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Abstract —Developing a reliable and practical face recognition 
system is a long-standing goal in computer vision research. 
Existing literature suggests that pixel-wise face alignment is the 
key to achieve high-accuracy face recognition. By assuming a 
human face as piece-wise planar surfaces, where each surface cor¬ 
responds to a facial part, we develop in this paper a Constrained 
Part-based Alignment (CPA) algorithm for face recognition 
across pose and/or expression. Our proposed algorithm is based 
on a trainable CPA model, which learns appearance evidence 
of individual parts and a tree-structured shape configuration 
among different parts. Given a probe face, CPA simultaneously 
aligns all its parts by fitting them to the appearance evidence 
with consideration of the constraint from the tree-structured 
shape configuration. This objective is formulated as a norm 
minimization problem regularized by graph likelihoods. CPA can 
be easily integrated with many existing classifiers to perform part- 
based face recognition. Extensive experiments on benchmark face 
datasets show that CPA outperforms or is on par with existing 
methods for robust face recognition across pose, expression, 
and/or illumination changes. 


I. Introduction 

Developing a reliable and practical face recognition system is 
a long-standing goal in computer vision research. A tremendous 
amount of works have been done in the past three decades, 
however, most of them can only work well in controlled 
scenarios. In more practical scenarios, performance of these 
existing methods degrades drastically due to face variations 
caused by illumination, pose, and/or expression changes [1]. 

To handle the variation caused by illumination change, the 
methods of Lee et al. [2], Wagner et al. [3], motivated by 
the illumination cone model fi , 5], used multiple carefully 
chosen face images of varying illuminations per subject as 
gallery. Given a probe face of unknown subject under arbitrary 
illumination, face images under this specific illumination can 
be generated in the gallery to match with the probe face. 
When multiple gallery images of such kind are not available, 
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alternative methods [f , 7] considered extracting illumination 
invariant features for face recognition. 

To address pose or expression variations, earlier approaches 
extend classic subspace or template based face recognition 
methods [8, 9]. The use of these methods is rather restricted 
due to their dependence on the availability of gallery images 
of multiple facial poses or expressions. Recent approaches 
consider more practical scenarios where only face images under 
the normal condition (frontal view and neutral expression) are 
assumed to be available in the gallery. To recognize a probe 
face with pose or expression changes, they either identified an 
implicit identity feature/representation of the probe face [ 1 0, 
11, 12], which is pose- and expression-invariant, or explicitly 
estimate global or local mappings of facial appearance so that 
a virtual face under the normal condition can be synthesized 
for recognition [13, 14, 15, 16, 17]. However, most of these 
methods are still far from practice since they assume both 
gallery and probe face images have been manually aligned 
into some canonical form, and they cannot generally cope with 
illumination variation either. 

In literature, the most popular approaches for automatic 
face alignment across pose or expression are based on facial 
landmark localization, e.g., the Active Appearance Models 
(AAMs) [18, 19] and elastic graph matching (EGM) [20]. 
A AMs and EGM used densely-connected elastic graphs, which 
however, are difficult to optimize in that the solutions are likely 
to be trapped into undesirable local minima. Consequently, 
localized landmarks using these methods are often not accurate 
enough, especially when applied to unseen face images. To 
improve the localization accuracy, an explicit shape constraint 
for graph nodes was considered in the constrained local 
models (CLMs) [21]. CLMs [21, 22, 23] are still based on 
densely-connected graph models, and their shape constraints 
are over-simplified so that the dependency among different 
graph nodes is ignored. Recently, deformable part-based models 
(DPMs) show their promise in many applications such as 
object detection [2 ] and facial landmark localization [2 ]. In 
particular, Zhu and Ramanan [25] adopted a tree-structured 
part model, which encodes node dependency while admitting 
efficient solutions. Such a tree model was used by Zhu and 
Ramanan [25] for facial landmark localization. Nevertheless, it 
cannot be readily extended for pixel-wise face alignment, and 
consequently for face recognition across pose or expression. 

With the aim of developing a system that can simultaneously 
handle illumination variation and minor changes of pose 
and expression, Wagner et al. [3] recently leveraged sparsity 
optimization and a carefully prepared gallery set (multiple 
images of varying illuminations per subject), to align probe 
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face images into a canonical form for better recognition. In 
particular, they assumed human face as a planar surface and 
chose a global similarity transformation for face alignment. 
This may be valid when probe faces are close to the normal 
condition. It is however a rather simplified assumption when 
there exist pose and/or expression changes. Indeed, a human 
face has non-planar geometry and non-rigid deformation. Some 
portions of the face (e.g., the nose region) undergo significant 
appearance changes as the pose varies, while other portions (e.g. 
the mouth region) deform significantly as expression changes. 
It is thus more appropriate to approximate a human face as a 
piece-wise planar surface, and correspondingly use piece-wise 
geometric transformation for face alignment. 

Improving performance using enriched models is not an easy 
task in computer vision research. In fact, opposite effects often 
happen. As to the problems considered in this paper, we will 
show that by partitioning a human face as a collection of parts, 
and carefully characterizing the deformation relations among 
different parts, performance of face recognition across pose 
or expression can be significantly improved. In particular, we 
propose a Constrained Part-based Alignment (CPA) method for 
this task. Our method is partially motivated by the promise of 
piece-wise planar formulation and the explicit shape constraints 
used in CLMs [ 21 , 22 , 23 ] and DPMs [ 24 , 25 ]. 

CPA partitions the object of interest, e.g., a human face, as 
a constellation of parts, and uses a similarity transformation 
to model the deformation of each part. A tree-structured shape 
model is used in CPA to constrain the relations among the 
deformations of different parts. A CPA model also has a 
batch of registered face images serving as the appearance 
evidence of each part. For a probe face image, all its parts 
are simultaneously aligned towards the registered model, by 
fitting them to the appearance evidence and penalizing the 
cost of violating the constraint from the tree-structured shape 
model. We formulate this objective as a norm minimization 
problem regularized by graph likelihoods, which is solved 
by an alternating method composed of two steps: one for 
aligning the parts, and the other for adjusting the configuration 
of the tree-structured model. The former can be reduced into a 
sequence of convex problems, while the latter admits efficient 
solutions by gradient decent methods. In this paper, we use the 
proposed CPA model for face recognition, where registered 
gallery images are taken as the appearance evidence, and a 
probe image is aligned using the CPA model. After alignment, 
most of the off-the-shelf face recognition methods can be 
readily used for part recognition.The overall decision is made 
by aggregating predictions from different parts by a plurality 
voting scheme. Robustness against illumination variation can 
also be achieved by choosing specific face recognition methods, 
e.g., [ 3 , 26 ]. 

Richer models are often more difficult to train. For the 
proposed CPA model, both the appearance model and the tree- 
structured shape model need to be automatically learned. On 
the one hand, the appearance model is obtained by aligning 
the gallery images in batch with the constraint from a given 
shape model. In particular, we use low-rank and sparse matrix 
decomposition as the criteria to optimize batch alignment of 
each part, and globally regularize these individual problems of 


part alignment by graph likelihoods. On the other hand, the 
shape model is learned by the probabilistic inference based 
on given part constellations, where the maximum a posteriori 
(MAP) estimation is performed with the graph likelihoods 
and the given conjugate priors of the likelihoods. As the 
two problems are coupled, we solve them in a joint way. 
Moreover, we generalize our learning method to train a mixture 
of CPA models (mCPA) for better handling pose and expression 
variations. 

Part-based face recognition methods [ 27 , 28 , 29 , 30 , 31 ] have 
shown improved performance over more standard approaches 
using holistic faces. However, existing part-based methods can 
only be applied when gallery and probe face images have been 
manually registered to each other. Our proposed CPA method 
enables this promising part-based strategy to be applicable 
in more practical scenarios, by automatically registering the 
gallery and probe face images in terms of deformable parts. In 
this paper, we present experiments on the Multi-PIE [ 32 ] and 
MUCT [ 33 ] datasets and show that our proposed CPA method 
can simultaneously and effectively handle illumination, pose, 
and/or expression variations. 

• Comparing with the natural alternative methods that 
holistically align face images, followed by either holistic 
face recognition or part-based recognition, our method 
gives significantly improved performance. 

• State-of-the-art pose-invariant face recognition 

method [ 34 ] relies on model learning using a large 
number of 3D face shapes, while training of our proposed 
CPA only requires 2D face images of a few subjects. 
Nevertheless, our method outperforms that of Li et al. 
[ 34 ]. when the degrees of pose change are within ±15°. 
This range of pose change is often encountered in 
practical access control scenarios, where test subjects 
would be cooperative with face recognition systems. 

• Notably, while CPA is motivated to address the challenges 
of face recognition across pose/expression, it performs 
surprisingly well when probe face images are at frontal 
view and with neural expression. Our results in the frontal- 
view, neutral-expression, varying-illumination, and across- 
session setting of Multi-PIE dataset are better than all 
existing methods, and as high as 99.6%. This confirms that 
considering human face as a piece-wise planar surface and 
aligning it part-wisely are very effective for high-accuracy 
face recognition. 

• We also empirically investigate the discriminative power 
of individual facial parts, and the robustness of our method 
against partial occlusion. 

Details of these investigations are presented in Section VI. 

The rest of this paper are organized as follows. Section II 
reviews more related work in addition to those discussed in 
Section I. Section III presents our proposed CPA model and 
how to use it for alignment of probe face images. Section IV 
combines CPA with existing methods for part-based face 
recognition. Details of CPA model learning are presented in 
Section V, where we also extend CPA to a mixture of CPA 
for better handling pose or expression variations. Intensive 
experiments are finally reported in Section VI to show the 
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Figure 1: Illustration of a CPA model instance with 14 facial parts: Face is considered as a constellation of parts constrained 
by a tree structure, where each part corresponds to a pre-dehned domain (i.e., shape and location) (Sub-figure i). For an image, 
deformations are computed (Sub-figure ii) for transforming the parts to the domains (Sub-figure iii) where they are aligned to 
the appearance evidence given by the linear combination (Sub-figure iv) of existing registered images (Sub-figure v). 


efficacy of our proposed method. 

II. Related Work 

As discussed in Section I, most of existing methods for 
face recognition across pose or expressions assume gallery and 
probe face images have been manually registered, and they can 
only be applied to controlled scenarios. In particular, Chai et al. 
[ 13 ] learned local patch based mapping relations from non- 
frontal pose to frontal pose by locally linear regression. Jorstad 
et al. [ 15 ] estimated pixel-wise registration from non-neutral 
expression to neutral expression by optical flow. Arashloo 
and Kittler [ 16 ] considered a Markov random field (MRF) 
model to regularize 2D displacements of local patches across 
different poses. For automatic face recognition across pose, 
given a probe face image, AAMs [18, 19] optimized localization 
of a set of facial landmarks to realize face alignment of 
pixel-level accuracy. However, the landmarks localized by 
AAMs are usually not accurate enough, and also the pixel- 
wise correspondence induced by matched landmarks are not 
consistent enough across different poses and expressions 
[ 35 ]. Among existing methods, maybe the most successful 
ones across pose and/or expression are based on 3D models 
[ 36 , 35 , 34 ]. In spite of their promise, their relying on 3D 
data makes them less relevant to the 2D techniques considered 
in this paper. However, we will show that our proposed CPA 
method compares favorably with them when the pose variations 
are in a reasonably confined range. 

Deformable part models (DPM) have shown their success 
in object detection [ 24 ], facial landmark localization [ 25 ], 
and human pose estimation [ 37 ]. In particular, these methods 
optimize an objective function that scores both the appearance 
evidence and spatial constellation of the parts, where the former 
is scored by part detectors, and the latter is scored by a star- 
or tree-structured shape model, which constrain the pair-wise 
offsets of different parts. Our use of part-based models is 
different from the DPM based methods [ 24 , 25 , 37 ]. Our 


CPA method aims at pixel-wise image alignment rather than 
finding the bounding box of an object/part or locating a small 
amount of landmarks. We measure the appearance evidence of 
parts by their similarity to aligned galleries, and constrain the 
part deformation relations by a tree-structured shape model. 
Compared with shape constraints in DPMs [ 25 , 37 ], our 
proposed one models more complex relations with the part 
constellation and holds strict probabilistic properties that are 
beneficial to the model learning. 

Our method is also closely related to the work of Wagner et al. 
[ 3 ]. To cope with illumination variation, they used multiple 
gallery images of varying illuminations per subject. Our method 
also follows this strategy. However, we take a human face 
as a collection of parts rather than assuming it as a planar 
surface as in the method of Wagner et al. [ 3 ]. The piece-wise 
planar assumption used in our CPA model can handle much 
more complex facial appearance variations such as large pose 
and/or facial expression changes. Furthermore, we regularize 
the deformations of individual parts by a shape constraint in 
order to prevent the alignment from degenerated solutions. 
Such regularization term is absent in the method of Wagner 
et al. [ 3 ]. 

III. CPA: Constrained Part-based Alignment 

Assume we have face images of multiple subjects in a 
database, and these images have been registered into some 
canonical form, i.e., aligned to some template. The template 
we consider in this paper is composed of a collection of parts. 
These parts of varying sizes are spatially arranged in a proper 
manner so that overall they define our face template (Fig. 1-i). 
Suppose there are m parts in the template. Each of the m parts 
has its associated face region in each image in the database. 
We call these face regions associated with every i th part a part 
dictionary , which is denoted as D^> (Fig. 1-v). Let y be a test 
face image that is not generally in the canonical form (e.g, due 
to pose change or misalignment). As the name of CPA suggests, 
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(b) Independent part alignment by Wagner et al. [3]’s 
method at Iteration 17 



(c) Our CPA method at the converged stage 


Figure 2: Part alignment obtained without/with a shape constraint. Parts are initialized at dashed red boxes, and aligned to the 
appearance evidence suggested by the part dictionaries. The solid green boxes denote the obtained domains of the parts, y is 
the image cropped from an obtained domain, D • x is its best linear reconstruction with the given part dictinoary, and e is the 
reconstruction error. The cropped images are normalized in terms of intensity for display convenience. 


constrained part-based alignment aims to align y to the face 
template so that different face regions of y are respectively 
registered with the corresponding parts of the face template. 
This part-based alignment can be realized by pursuing a set 
of transformations v = ..., v that act on the 

image domain of y by y o i/W, i = 1 ,..., m. 1 

To pursue iz, we consider techniques used in [26, 3]. In 
particular, Wright et al. [26], Wagner et al. [ ] assumed there 
exist multiple registered face images of varying illuminations 
per subject in the database. A well-aligned test image could 
be represented by a linear combination of registered training 
images, plus a sparse error term to compensate for data 
corruption or various intra-subject variations. By leveraging the 
error sparsity assumption, Wagner et al. [3] optimized a holistic 
similarity transformation by solving a f^-norm minimization 
problem. Extending techniques in [3] directly to part-based 
alignment gives the following objective 

rri 

min 

x(i) ?e (i> 
i= 1 , 2 ,... ,m 

s.t. y o + e<^, (2) 

where || • ||i is the ^ 1 -norm that encourages the sparsity of errors 
{ e ^}iL i- yoi/M aligns the i th part of y , and {A^ E 
balances the alignment errors of the m parts. The matrix 
is the i th part dictionary, whose columns represent the aligned 
regions of face images in the database, and denotes the 
reconstruction coefficient. 

The above direct extension of [3] essentially aligns the m 
parts independently. Unfortunately, it empirically appears to be 

1 Let {(n,i;)} denote image coordinates of y, and be a 2-D coor¬ 
dinate transformation function. Then, image transformation is realized by 
y o = y(zA z > (u, v)), and we also use y o z/6) denotes the 

transformed image (the i th part). belong to a d-dimensional 

transformation group G, e.g. the similarity group or affine group. In this 
paper, we assume G to be the 2-D similarity group and parameterize the 
similarity transformation mapping (u,v) to (u',v') as (t u ,t v , s,6) T E M 4 
that satisfies 



With a little abuse of notation, we simultaneously take i/6> as a transform 
function and a column vector composed of the d parameters of G (d = 4). 
Accordingly, iz £ R dXrn . 




very unstable for facial part alignment. As shown in Fig. 2b, the 
parts may often drift away from their initialization to some more 
“flat” face regions. Indeed, alignment by optimizing zzW in 
yov^ is a non-convex procedure (as explained in more details 
in Section III-B1). Compared with an entire face, individual 
parts contain less visual structure and thus alignment of them is 
prone to meaningless local minima. To overcome this problem, 
we consider in this paper incorporating some sort of global 
information of facial structure to regularize the deformations 
of individual parts. In particular, we are motivated by [25] to 
use a tree-structured shape model to constrain the difference of 
transformation parameters of different parts. We write g{y,Z) 
for the regularization term determined by the tree-structured 
shape model, where Z denotes model parameters in the form 
of a tuple. The tree-structured shape model is illustrated in 
Fig. 1 i - iii and will be elaborated in Section III-A. With 
consideration of the tree-structured shape constraint, we revise 
(2) to formulate the alignment objective of our proposed CPA 
model as 

m 

min y^A^lle^ 

i=l,2,...,m 1 

s.t. y o — D^x^ + , (3) 

where y E M + weights the regularization term. The choice of 
{A<*)}m i w jp t> e addressed when we learn the CPA model in 
Section V. As shown in Fig. 2c, our proposed CPA method 
can produce very stable alignment results using the same part 
dictionaries (Fig. 2a) as independent part alignment does. In 
Fig. 3, we give all the part instances of the CPA model that are 
fitted to real face images. We note that CPA is not designed 
to produce a seamless face constituted by deformed individual 
parts, as AAMs [18, 19] can do. Instead, it aims to establish 
correspondence of facial parts across pose or expression so 
that face images with intra-subject variations can be matched 
at the part level to improve face recognition. 

A. Tree-structured Shape Model 

Let (V, £) denote the tree, where V = {0,1,2,..., m} is the 
vertex set, and £ is the set of directed edges. V is composed of 
the nodes of the m facial parts and a root node (the node “0”) 


Hi + vg (y,z) 
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Figure 3: CPA model instances fitted to face images. Each 
facial part is denoted by a box in a specific color. The circled 
disk at its center denotes the associated tree node. The white 
solid lines denote the tree edges. Bigger nodes correspond to 
higher levels of the tree. The nodes with black dots are linked 
with the root node, which is not displayed in the figure. 


corresponding to the holistic face. Every i th node takes as 
its variables corresponding to the transformation parameters of 
the facial part. For the root node, we use = (0,0,0,0) T 
when assuming the face at the global level has been registered 
into the canonical form, which suggests (u, v ) = (u, v). 

We consider this simplified case in the following derivation of 
the tree-structured shape model. 

In order to build a model on the tree, we first associate 
with each edge the difference of transformation parameters of 
its two end nodes. We then concretize g(y,Z) as follows. 
In particular, for the edge (j,i) G £ with parent node j 
and child node i, we use a multivariate Gaussian distribution 
with mean G R d and precision G R dxd to model 
the transformation difference z/j^ = z/W — v^ G i.e., 
\z^) = 1 ) with = (/i^,A^), 

where the Gaussian probability distribution function (PDF) is 

= j^y/2 ex P- m)| • 

( 4 ) 

Thus, let Z — (z^ x \z^ 2 \ ... , z^) be a tuple of Gaussian 
parameters. 

Since the definition of p(o^\z^) is free of v^ for (j, i) G 
£, we have essentially assumed the independence between 
and which suggests p(u^\v^\z^) = p{y^ \ z^). 
Thus, we take our tree-structured shape model as a Bayesian 
network, whose joint probability is 

p({^}? = 0 \Z)=p(vM). n 

m 

= II^V>). (5) 

i= 1 

where p(u^) = 1. Finally, taking the negative logarithm of 
the joint probability gives the regularization term g(v,Z) in 
(3) as 

g{v,Z) = -lnp({i»}™ 0 |2:) 

1 m 

= 2 ~ M <i> ) T A <i> (^ i> - M (i> ) + b, (6) 

i= 1 

where b = ln(27r) — \ i | is the term indepen¬ 

dent of {!/<*> }£i. 


Similar tree-structured shape model and quadratic regu¬ 
larization term were also used in [25] for face detection 
and facial landmark localization. However, they ignored the 
dependency among variables of the difference of transformation 
parameters associated with any pair of facial parts connected 
by an edge in the tree, which, in our case, equals to constrain 
A^ to be diagonal. Compared to [25], our shape constraint 
is derived from a Bayesian network formulation, which not 
only interprets the underlying probabilistic properties of tree- 
structured shape models, but also enables maximum a posteriori 
(MAP) estimation of the model parameters Z in the training 
stage of CPA. We will show later that the probabilistic prior 
introduced for MAP estimation prevents the CPA training from 
degenerate solutions when image alignment and tree-model 
learning are jointly formulated in an unsupervised way. This 
scenario is different from [25], as their tree-structured shape 
model is integrated in a classification problem, which is strongly 
supervised. 

In fact, parametrization of AAMs and CLMs [21, 23] is also 
based on a joint Gaussian model, which seems to be similar 
to (6) derived from a tree model. To see the difference, if we 
concatenate the variables as vec(zz) and model it as 

a Gaussian distribution, the corresponding precision matrix, 
denoted as A G will have a block sparse structure 

with at most d 2 (m + |{(j, i) £8 : j / 0} |) non-zero entries. 
The assumption on the independence between and v^ for 
(i, j) G £, which constrains A’s degree of freedom to be d 2 m , 
distinguishes the tree-structured shape model from a general 
joint Gaussian model with block diagonal precision. 

Up to now we have assumed the face at the global level has 
been registered into the canonical form. In practice, however, 
observed face images are not generally in this canonical form. 
Consequently, the learned tree-structured shape model for a 
globally canonical face will not apply directly. To mitigate 
this problem, we consider optimizing an additional holistic 
transformation a , so that the learned tree-structured shape 
model can still be useful to constrain the optimization of v. 
The CPA objective (3) is then rewritten as 


min 

x( i ) ,eW ,is,cr 


52 A<i> H e<z> Hi +T]9 


i 


s.t. y o a o + e^. 


( 7 ) 


We call a the holistic deformation , and keep calling the 
i th part deformation. 


B. Optimization 

Given a learned tree-structured shape model, we present 
algorithms to solve our CPA objective (7) in this subsection. 
The main difficulty of solving (7) comes from the non-convexity 
of its constraints y o a o z/W = i = 1,..., m, 

which couples the nonlinear operations of holistic deformation 
<j and part deformations {o^ S) } 7 fL 1 on the image domain. We 
choose the strategy of alternating optimization: we first update 
{ e ^}iL i, and v together while fixing a, and we 
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then update cr 2 . These two steps are alternately applied until the 
algorithm converges. Note that this alternating strategy requires 
relatively good initialization of cr and zz, so that the learned 
tree-structured shape model can be effective to constrain the 
optimization of zz. We defer the discussion of this issue in 
Section III-B3 while assuming at this moment that the initial 
a and zz are good enough to start the alternating process. 

I) Solving Part Deformations : Given fixed a , we update 
each zzW by a generalization of the Gauss-Newton method. 
More specifically, for a linear update from to zzW + Av^\ 
the left-hand side of the equality constraint in (7) can be 
approximated by its first-order Taylor expansion at zzW, i.e., 
y o a o (zzW + AzzW) « y o cr o z/W + J^AzzW, where 
JW = d(y o cr o v®)/du® is the Jacobian with respect 
to (w.r.t.) z/W. Let Azz = [Azz^ 1 ), Au^ 2 \ ..., Az/ 777 ^]. The 
above linearization leads to the following problem to optimize 
{*<*>}£,, {e^l^Ai/ 

m 

min V ||e^ II 1 + rjg (v + Av, Z ) 

£c< i ) ,e<*) ,Ai^ 

i=l ,2 ,... ,m 1 

s.t. yoaov^ + =D®x® + e (i> . (8) 

We repeatedly solve the problem ( 8 ) to linearly update zz, until 
converging to a local minimum, which gives the solution to 
the original problem (7). Similar iterative techniques have also 
been used in related works [38, 3], and showed good behaviors 
of convergence. 

We solve the convex problem ( 8 ) by adapting the Augmented 
Lagrange Multiplier (ALM) method [39]. Let 

h(x®,e®, Ai/W) = 

y o a o iA 1 ') + jW Av^ — — e^K (9) 


The augmented Lagrangian function for ( 8 ) can be written as 

L 0 ({xW}™ l5 {e<7™!, Ai/, i) = 

m s 

X! | A<<> ii e<<> ii! + ('y w ’ e(i> > Az/(i> )) a°) 




+9 (i/ + Au,Z), 


where {7^}i^i are the Lagrange multiplier vectors, ||-|| F 
denotes matrix Frobenius norm, and (•, •) denotes inner 
product of vectors or matrices. Instead of directly solving 
the constrained problem ( 8 ), ALM searches for a saddle point 
of (10). Given initial { 7 ^}^, it iteratively and alternately 
updates {a^} 77 ^, {e^} 77 ^, Azz and { 7 ^}^ by 
1 ) {^ > }™ 1 ,{e<*>}™ 1 ,Az/^ 


argmin ({*«}£ 1; {e<‘>}£ ls Au, { 7 ^}™ ,); (11) 

x( l ) ,e< 1 ), 

i = l ,2, ... ,m 

2 ) 7W 7W -f fa . , AzzW) ? 

for i = 1, 2...., m; 

2 Instead of updating O’, and while fixing i/ in the 

second step, we propose a more efficient approach that jointly updates a and 
v while keeping and {e(9 1 fixed. Details of the approach will 

be presented in Section III-B2. 


where l is the iteration number, and {/3i}i= 1 , 2 ,... is an increasing 
sequence. ALM is proven to converge to the optimum of the 
original problem as fa becomes sufficient large [40]. 

It is still difficult to directly solve (11) w.r.t. all the three 
groups of variables: {x^} 7 j fa 1 , {e^} 7 j fa 1 , and Azz. Instead, we 
consider updating them in an alternating manner. And it turns 
out that each subproblem associated with any one group of 
variables has a closed form solution. More specifically, let 

S a (x) = sign(x) • max{|x| — cr, 0 } ( 12 ) 


be the soft thresholding operator. For i = 1,2, ...,ra, we 
update and sequentially by 


r fi) y{ 1 ) o (J o v^i + (l/fa)^ l \ 

X {i) <- (p^ T Dfa 1 D {i)T (r {i) + i/< i > - 

e {i) <- S xii}/Pl (r {i) + J^Au^ - 

where is an auxiliary variable used for notation convenience 
only. Then, the optimum Azz can be found by solving 

_dL^{x^}r =1 ,{e {i) }T =1 , Ai/, {7 (i) }r=i) m , 

3Ai/« U ; 

=/3j (r w + J^Av {i) - D {i) x {i) - e^) J (i> 

for i as 1, 2,..., m. Let {q} 7 ™ : denote the standard basis of 
M m . The m equations together form a system about Azz with 
the following form: 


X G<<> Ai/fc«f + »? 


dg(v + Azz, Z) 
dAv T 


= Q, 


(15) 


where e M dxd , and Q e M dxm . Here, G& = 

faJ^ T j( l \ and the i th column of Q is faJ^ T (D^x^ + 
e<*) — rW). As ^(zz + Azz, Z) has the form of the summation 
of the quadratics of {Azz^}™^, (15) i s a linear system. It is 
usually sparse due to the tree structure of the shape model. 
Hence, even a large number of facial parts are present, we 
can still simultaneously and efficiently solve {Av^} 1 fL 1 with 
high precision. We describe the expanded form of (15) in 
Appendix A. 

Instead of solving (11) exactly by converged iterations of 
alternating optimization of {e^} 7 ^ 1 and Azz, we 

use inexact ALM that updates the three groups of variables 
alternately for only once in each iteration of the ALM method. 
Compared to exact ALM, inexact ALM shows better practical 
performance in terms of optimization efficiency [39]. In 
summary, we solve (7) by repeatedly solving the linearized 
problem ( 8 ), which itself can be efficiently solved by inexact 
ALM. 

2) Solving Holistic Deformation : Given fixed zz in (7), 
the most straightforward way to optimize cr, together with 
an( j { s use similar techniques as in 

Section III-B1, i.e., linearizing the constraint in (7) w.r.t. 
cr and using ALM to solve the linearized problem. In this 
paper, we propose an alternative approach that involves the 
variables cr and zz only, without performing the actual image 
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Algorithm 1 Outer loop for CPA - Default Option 
1: Initialize zz and a 
2: while NOT converged do 

3: zza- argmin min il e ^ 111 + V9 %) s -t. (7) 

is cc( i ),e^ i 3 * * * ) 
i = l ,2,... ,m 

4: <— a o i/W 

5: zz, crG- arg min g (zz, £) s.t. cr o zzW = 

is,a 

6: end while 


deformation, and is thus much more efficient compared to the 
aforementioned standard approach. More specifically, given 
fixed {x^ }™ =1 and {e^ } 7 fL 1 obtained in the previous iteration 
of part deformations, the right-hand side of the equality 
constraint in (7) is also fixed. If we keep it unchanged in 
the update of cr, it indicates that cr o zzW for a combined 
deformation will not change. Denote = cr o z/W for this 
fixed combined deformation, we propose to update a and zz 
jointly by optimizing 

ming(zz,Z) s.t. cr o zz^ = ( 16 ) 

l/,cr 

which does not involve any actual image deformation and can 
thus be efficiently solved. For similarity transformation consid¬ 
ered in this paper, we present in Appendix B the associated 
parametrization of (16) and the optimization algorithm. 

Algorithm 1 gives the outer loop of the algorithms presented 
in Sections III-B1 and III-B2 for the proposed constrained 
part-based alignment. 

3) Initializing Holistic and Part Deformations: Algorithms 
in Sections III-B1 and III-B2 require relatively good initial¬ 
ization of cr and zz. In practice, we rely on off-the-shelf face 
detectors [41, 25], which provide a rough bounding box of the 
face or the holistic pose and locations of facial parts [25]. We 
initialize cr using the bounding box provided by face detectors. 
If locations of facial parts are available, we also use them 
to initialize zz. Otherwise, we choose the initial zz as the part 
deformations that maximize the likelihood of the tree-structured 
shape model. This initialization acts as the average template of 
the shape constraint, which is independent of any probe image. 

The initialization based on face detectors is generally not 
accurate enough for face recognition. Fortunately, in most cases 
they are good enough for automatic alignment. Fig. 3 present 
examples that illustrate the converged solutions of our CPA 
algorithm. 

In some cases when face detectors give worse outputs, our 
CPA algorithm might take a longer time to converge or might 
converge to an inconvenient solution. To handle this situation, 
we use the holistic alignment method of Wagner et al. [3] to 
initialize the holistic transformation a. Their method is not 
able to align non-frontal faces well due to its holistic planar 
surface assumption, however, it is good enough to be used as 
our initialization. 

IV. CPA based Face Recognition 

In this section, we propose a CPA based face recognition 
method. Suppose there are N subjects in the gallery set, and 
each of them has multiple face images. Suppose these face 


images have been aligned at the part level to form the part 
dictionaries, which are denoted as = [d[ 1 \ ..., D$] for 
i = 1,..., m. Given a probe face y , CPA suggests aligning 
facial parts of y to {D^} 7 j r ^ 1 by solving (7) for a part-based 
face recognition. However, similar to the situation of holistic 
face alignment in [3], the presence of facial parts of multiple 
subjects in makes (7) have many local minima, 

corresponding to aligning y to the facial parts of different 
subjects. Instead, one can perform CPA in a subject-wise 
manner by optimizing 

m 

... min 52 A<i> H e S i> Hi 

(i) U) < 

x\ ,e s ,is s ,<j s ■ 

i=l,2,...,m 1 

s.t. y o a s o i/<*> = , (17) 

where v s = [zvj 1 ^,..., a s , and {eA }l=i are variables 

of part deformations, holistic deformation, and the alignment 
residuals w.r.t. the 8 th subject respectively. After solving (17) 
for all the N subjects, for every 7 th part we sort the subject- 
wise alignment residuals {||e^ ||i}^Li and select the top C 
subjects with the smallest alignment residuals. Part dictionaries 
of these C selected subjects are then put together to form a 
pruned gallery of facial parts. To make the facial parts in the 
pruned gallery all aligned to the 7 th part of y , we transform 
by (a s ozzj^) -1 instead of transforming y by a s oi/^ for 
each selected subject s. For part-based face recognition, many 
existing methods such as SRC [26], LBP [42] and LDA [43] 
can be used at the part level, based on the pruned gallery. Final 
recognition can be performed by aggregating the part-level 
decisions using the basic plurality voting scheme. 

For obtaining better performance, we may adapt advanced 
aggregating schemes, such as the kernelized plurality voting 
[29], and joint recognition method for multiple observations, 
such as the joint dynamic sparse representation [30]. Nonethe¬ 
less, we consider only the basic choice to keep our work from 
obesity. 

Algorithm 2 gives a summary of our proposed CPA based 
face recognition method. An illustrative procedure with 6 
sequential modules is also shown in Fig. 4. Note that there 
are a few technical details in Algorithm 2 that can make a 
difference to recognition performance. In particular, as the 
selected subjects are probably inconsistent for different parts, 
we combine them together to substitute the selected subject set 
for each part (Line 13) so that the groundtruth subject has a 
higher possibility to be included for part recognition. Further, 
to control the number of overall selected subjects, we set C 
to the smallest integer making the pruned gallery size no less 
than a given parameter P (Line 13). We also adjust the 7 th part 
of y by the averaged transform of the 7 th part dictionaries of 
the selected subjects before aligning them to it (Line 15,16). 

V. Model Learning for CPA 

Algorithms in the preceding sections assume that the CPA 
model, i.e., the tree-structured shape model and its associated 
part dictionaries, have been given. In this section, we present 
how to learn them from a training gallery of face images. Note 
that each part dictionary is composed of a set of registered 
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Figure 4: The framework of the recognition algorithm in accordance with Algorithm 2 


Algorithm 2 CPA based face recognition 

Require: Subject-wise part dictionaries for 

the m parts and N subjects; the transformation group G; 
the parameter tuple Z of tree-structured shape model ; the 
probe image y; and, the chosen subject number P 
Ensure: identity(y) 

> Subj ect-wise alignment 

l: for each subject s do 
2 : Align y by CPA using {D^} 7 ^L 1 , 

and get {efA^i, a s 

3: for each part i do 

4: Vs l) o (J s 

5: end for 

6: end for 

> Pruning subjects 

7: for each part i do 

8 : Sort subjects in ascending order according to 



and get the orders 

7 

5***5 ^iV 

9 

end for 
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for j = 1,2, ...,N 

do 
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end for 
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> Part-wise recognition 

14 

for i = 1, 2,... , m 

do 
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£<*) mean{(^ 
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y( l ) J— y o 




17 

for each s E Be 

do 



18 

ci i} <- ci i} ° 


— 1 


19 

W <- Di l) 




20 

end for 




21 

Use an existing recognition 

method to recognize y^ 


with the pruned gallery {L)i l \ s} se & c , 
and get the predicted label id^ 

22: end for 

> Decision by voting 
23: identity(?/)«—the label occurs the most times in {id^ } 7 j Z 1 


facial parts of gallery images. For some facial parts, e.g., 
those around the chin position, it is even difficult to manually 
annotate facial landmarks in order to align them with high 
precision. To realize the automatic alignment of facial parts 
to form the part dictionaries, we propose an approach that 
couples the learning of a tree-structured shape model and that 
of part dictionaries. Indeed, the relational constraints among 
deformations of different parts provided by the tree-structured 
shape model make alignment of part dictionaries become stable, 
and the deformation parameters of facial parts of the training 
images also give statistical evidence to learn the parameters Z 
in the CPA model. Learning them in a coupled way is thus a 
natural choice. In the following, before presenting algorithmic 
details of this coupled CPA model learning, we first present 
how to align and form the part dictionaries given the constraint 
from a known tree-structured shape model, which will serve as 
one of the alternating steps in the coupled CPA model learning 
algorithm. 

A. Learning Part Dictionaries with Constraint from Known 
Tree-structured Shape Model 

We learn the part dictionaries from a training gallery of 
n face images, denoted as D = [c?i, ..., d n \. Every k th 

column 4 in P represents a stacked vector form of face 
image in the gallery. Learning part dictionaries is concerned 
with optimizing the d-dimensional {cr^ E G}^ =1 for holistic 
deformations and E G}^ =1 , i = 1 ,...,ra, for part 

deformations, so that after part-based alignment every i th 
part dictionary, denoted as = [d±\ d^\ ..., dff] with 
d^k = d k oa k o v^ \ contains facial parts that have been 
registered into some canonical form, as shown in Fig. 5b. 

We observe that the registered facial parts in each part 
dictionary correspond to appearance of different subjects at the 
same facial region. These registered facial parts will ideally 
resemble each other if nuisances due to inter-subj ect, illumina¬ 
tion 3 , expression, and/or pose variations can be decomposed 
out. In other words, the part dictionary would be low-rank 
after decomposing out the aforementioned nuisances, which 
can be modeled as a sparse error matrix. This sparse and 
low-rank decomposition was used by Peng et al. [38] to align 
a batch of linearly correlated images, such as frames in a 
video sequence or face images of a same subject. Motivated 


3 If sufficient number of illumination conditions exist, illumination variations 
may not be classified as a nuisance in that they can be linearly modeled for 
low-rankness. 
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We define two operators that apply on if as 


T^\v) = 
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where (•) extracts deformation parameters of the i th part of 
the n face images, and Tfc(-) extracts those of m facial parts 
of the /c th image. We then write = DocroTA) (if), where 
“o” applies column-wisely. With these definitions our objective 
for learning part dictionaries can be written as 


min 

A3*> ,E( z '> 

i= 1,2 ,... ,m 


£ (p® 

i=l 


A<*> ||£’< i > ||i) 


Left eye 


Left ear Right corner 
of mouth 
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(a) Independent learning by Peng et al. [38]’s method at 20 iterations 
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(b) Learning with a known shape constraint at the converged stage. 


Figure 5: Part dictionary learning. Parts are initialized at dashed 
red boxes, and aligned in batch. The solid green boxes denote 
the obtained domains of the parts. D is a learned part dictionary 
(images cropped from the obtained domains), and A + E is its 
best low-rank and sparse decomposition, where A is the low- 
rank component, and E is the sparse error. The cropped images 
are normalized in terms of intensity for display convenience. 


by [38], in this paper, we leverage the very similar low-rank 
(matrices of part dictionaries without nuisances) and sparsity 
(error matrices modeling various variations) properties to align 
and form the part dictionaries. However, if we directly apply 
techniques of Peng et al. [38] to perform independent part- 
based alignment, the alignment process often converges to 
less meaningful solutions as shown in Fig. 5a. The reason 
is similar to that causing failure of applying the method of 
Wagner et al. [3] in the case of aligning parts of a probe face 
to the gallery set, as we explained in Section III. Instead, we 
constrain part-based alignment for learning part dictionaries 
by the tree-structured shape model. . 

To simplify the notations, we write cr = [cri,..., <r m ] G 
R dxn , and combine })_}■{ 2 ’’"’™ into a third-order tensor 


s.t. D o a o T {i) (i>) = A^ + E^\ (18) 

where || • ||* is the nuclear norm, which is a convex surrogate 
function of matrix rank. As shown by Peng et al. [38], Candes 
et al. [44], the penalty parameters {A^}^ 1 can be set as the 
reciprocal of the square root of E^’ s r ow n umber, denoted as 
In this paper, we set A^ = X/VufA with the constant 
A = 1. We accordingly set rj = rj • with rj 

chosen as 0.02. To solve (18), we use a similar alternating 
strategy as in Section III-B. Appendix C gives details of the 
algorithmic procedure. It requires the initialization of if, which 
can be obtained using the same method as in Session III-B 3. 
Fig. 5b shows example images of learned part dictionaries by 
solving (18). Compared with independent part-based alignment 
(example results shown in Fig. 5a), the proposed alignment 
method with the constraint from a tree-structured model gives 
much more meaningful results. 

B. Learning the Tree-structured Shape Model Jointly with Part 
Dictionaries 

To jointly learn the tree-structured shape model and part 
dictionaries, it is natural to take Z in (18) as the additional 
variables to optimize. To solve this revised problem, one 
can alternately update part dictionaries using the algorithms 
in Appendix C, which involves the variables {A^}'^L 1 , 
{E^}VL l9 if, and a, and estimate Z by maximum likelihood 
(ML) estimation, which fits parameters {z^ = (p^\ A^)} 7 fL 1 
of Gaussian distributions with the updated {Tk(i>)} k=1 . Unfor¬ 
tunately, this approach empirically appears to give degenerate 
solutions corresponding to a “non-elastic” shape model, where 
the Gaussian distributions modeling the deformation differences 
of connected nodes in the tree have variances of almost zero 
magnitude. Fig. 6a illustrates this phenomenon, where we 
measure the determinant and nuclear norm 4 of the covariance 
matrices {A^ }iLi> which are iteratively updated in an 
alternating optimization process. 

4 For a Gaussian distribution, the determinant of its covariance matrix is the 
product of the standard deviations in its principle directions; and, the nuclear 
norm is the sum of the variances in those directions. Both of the two indicates 
the overall strength of its variances. 
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(a) CPA learning with no prior for the tree-structured shape model 




(b) CPA learning with Gaussian-Wishart prior for the tree-structured shape model 

Figure 6: Strength of the variances of the Gaussian distributions for the differences of part transformations at different iterations 
of CPA learning. For a Gaussian distribution, the determinant (left-side figures) and nuclear norm (right-side figures) of its 
covariance matrix indicate the overall strength of its variances. Larger values indicate better flexibility of the model, and vice 
versa. In this illustration, we use 60 images from the MUCT dataset [33] to learn the CPA model define in Fig. 1. Both the 
determinant and nuclear norm are averaged over all the Gaussian distributions attached to the tree edges. 


To remedy this problem, we consider imposing a prior 
on Z to regularize the learning of tree-structured shape 
model. Denote h(Z,&) = — lnp(Z\&) with <f> as the hyper¬ 
parameters. Incorporating h(Z,&) into (18) results in the 
following new objective for joint learning of Z and part 
dictionaries 


min 

A<«> ,EW 

i=l,2,... ,m 


rri 

X;(||A< < >||. + A< i >||£7< < >|| 1 ) 


+T] 


Y J g(T k {v),z) + h{z,<s>) 


\k=l 


s.t. Do (TO T {i) (is) = A (i) + E {i) . (19) 

Let y l fl = v^' — y\f '. Based on (5) we have 

n m 

J2g(T k (i>),Z) = - 5 > p({4l} n k=i\z {i} ), (20) 

k =1 i=l 

where = (/iW,AW), and is drawn from the Gaussian 
distribution with mean /jl^ and precision A^. We impose prior 
on {/i^\ A^} using Gaussian-Wishart distribution, which is 
the conjugate prior of the Gaussian distribution. Denote (j)^ 
as the parameters of the Gaussian-Wishart prior for . We 


h(Z,&) = -^lnp^^l^). (21) 

i=1 

where <f> = ^ <p^) is the tuple of the hyper¬ 

parameters. 

Given <f>, we solve (19) by alternately updating the following 
two steps: 

1) Update Z by solving min^ Y%=i 9 (?fe(i>), Z) + 
h(Z^); 

2) Update is, cr by solving (18). 

To update the first step, note that by Bayes’ rule, 

n 

J2g (T k (i>),Z) + h(Z,<f>) (22) 

k=1 

m 

= ~Z { ln P(^ w |{^fc}fe=i^ <l> ) + ln P(( I/ i ) t}fc=i)} > 

2 = 1 

where does not change with z^\ and the terms 

of the summation have no overlap variables. Thus the above 
first step can be updated by independently solving the m sub¬ 
problems 

ma xp(z {i) A {i) ), (23) 

for i = 1,..., m. It is a standard maximum a posterior (MAP) 
inference for the Gaussian distribution, where the posterior 
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Algorithm 3 CPA model learning 

Require: Training images D , initial z>, cr, and prior weight fi 
Ensure: learned z>, cr, Z 
1: Z «- min 2 ]T;” =1 3 (T fc (i>),.Z) 

2: Set <f> in consistency to Z and fi 
3 : while not converged do 

4 : Update cr by solving (18). 

5 : Z ^mm z J2k=i9( T k(i>),Z) + h(Z,$) 

6: end while 


still follows a Gaussian-Wishart distribution, whose parameters 
are different from those of the prior. The maximum point of a 
Gaussian-Wishart PDF can be derived in closed form in terms 
of its parameters. Please refer to Appendix D for details of 
the Gaussian-Wishart prior and inference problems relevant 
to it. For the above second step, it is the problem of learning 
part dictionaries given a fixed Z, which is addressed in the 
previous subsection. 

Now, it comes the problem: how to set hyper-parameters 

l properly? Since the initialization of v gives rough 
evidence on how the tree-structured shape model should be, we 

set (j)^ to be consistent with the ML estimation on with 
the initial v for i = 1, 2,..., m. Thus, the trained CPA model 
will not deviate much from its rough estimation. Meanwhile, 
we also set properly to control the prior’s weight, i.e., 
to make the prior contribute to the MAP estimation as much 
as fin training samples, where fi > 0. Still, please refer to 
Appendix D for how to set fi with explicit physical meaning. 
The above learning procedure for a CPA model is summarized 
in Algorithm 3. 

The above algorithms for CPA model learning require a 
careful initialization on cr and £>, which can be done either 
by detectors [41, 25] or by manual annotations. In particular, 
for every k th training image, we localize the two eye corners 
to determine the similarity transform We then localize 
the facial landmarks at the centers of the facial parts 5 , and 
initialize to satisfy that the i th part has the pre-defined 
orientation and size relative to the canonical form of the entire 
face that is determined by cr^. Although this initialization 
scheme is very exercisable, but it makes the part transformation 
differences between linked parts identical for all the n training 
images. Consequently, we cannot get statistical evidence to 
learn Z. To overcome this difficulty, we propose a heuristic 
initialization scheme for learning the CPA model. In particular, 
we start with the part dictionary learning (18) by setting r] = 
0, i.e., without any shape constraint, and run it for only a 
few (here, 5) iterations of the generalized Gaussian method 
composed of the repeats of the linearized problem (32) (in the 
Appendix C). The updated part transformations are used for 
the actual initialization of the CPA model learning. 

Fig. 6b shows the effect of our proposed objective (19) for 
joint learning of the CPA model, where we investigate the 
covariance matrices {A^ YiLi of the Gaussian distributions 
obtained in each iteration of our algorithm. Compared with 

5 We always design the parts to center at the facial landmarks that are easy 
to annotate. 


Fig. 6a, the stability of their determinant and nuclear norm 
values suggests that a better tree-structured shape model is 
obtained. 

Up to now we have assumed that the configuration of the tree , 
i.e., how the nodes (facial parts) are connected to form the tree, 
is given. To learn the configuration of a tree, Zhu and Ramanan 
[25] used Chow and Liu [45] algorithm for the application 
of facial landmark localization. Chow-Liu algorithm finds the 
configuration of a Bayesian network that best approximates 
the joint distribution of all the variables. However, for a node 
with children, our tree-structured shape model (as well as Zhu 
and Ramanan [25] ’s shape model) assumes the independence 
between its part transformation and its part transformation 
differences with its children (refer to Section III- A), which is 
not an assumption for Chow-Liu algorithm. 

Instead, we take the tree configuration that minimizes the 
objective function for learning the CPA model, which can be 
reduced to Ylk=i 9 %) + h(Z, <f>) as (22). According 

to (20) and (21), we can decomposed it into the summation 
of part-wise scores, where each term associates only with the 
edge linking one node and its parent. In view of this, we first 
link the (ra + 1) nodes (recall the node of “0”) into a complete 
directed graph, compute the score associating with each of the 
(m + l)m edges, and take the minimum spanning tree rooted 
at “0” to be the optimal configuration of the tree. 

C. Learning Mixture of CPA Models 

The presented CPA model can be extended as a mixture 
of CPA models (mCPA) to cope with a larger range of pose 
and/or expression variations. Each component of the mCPA 
is parameterized the same as that of a standard CPA model, 
but it may have a different number of facial parts, since some 
facial parts may be occluded under pose changes. 

For an mCPA model with c components, Z L C {1,2,..., ra} 
denotes the index set of available parts for the t th component. 
Correspondingly, we also use to denote the part 

dictionaries and Z L = z[ 2 \ ..., } to denote the 

parameters of tree-structured shape models. If the t th component 
does not has the i th part, i.e. i £ Z L , Di * 1 * * * 5 and zf 1 would just 
be treated as void notations and would not be used. 

Learning an mCPA model is based on the formulation (19) 
for the standard CPA model learning. The difference is that we 
consider learning models of the c components altogether, so 
that the low-rank and sparse properties of corresponding part 
dictionaries belonging to different components can be overall 
leveraged, resulting in consistent part dictionaries across the c 
components. Given n training face images, we write D L for the 
n L face images assigned to the t} h components, cr L G R dxn <- 
for their holistic deformations, and v L G M dxmxn <- for their 
part deformations. Our objective for learning the mCPA model 
can be written as 

{ m 

E (p (i> II*+A {i > !!£<*> Hi) (24) 
1 —1 

c / n L 

+V EE g{T k (v L ),Z L ) + h{Z L ,$ L ) 

1 =1 \fc=l 
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S.t. 

A 0(7,0 T«(A) = A \<*> + for i e A 
are empty matrices for z ^ X M 

/l(i) _ r 4<*> d(9l 

AL — L^T-1 5 ^2 5 • • • 5 J 5 

£ {i> = [E®,E®,...,E®]. 

The above objective can be solved similarly as for (19). 
Namely, we alternately update the part dictionaries and the 
tree-structured shape models. After solving (24), we have the 
part dictionaries Dp' = D L ocr L oT^ (z>J and the shape model 
parameters {Z L } c i=1 . 

Given a gallery set, ideally an mCPA model should be 
learned from face images of this gallery. However, in some 
cases the gallery does not contain non-frontal/non-neural face 
images to learn the corresponding component models of the 
mCPA. To remedy this problem, we first learn an mCPA model 
using a separate training set that contains face images of all 
the interested poses and expressions of a few subjects, and 
then learn the part dictionaries on the gallery set of interest 
while fixing the learned tree-structured shape model from that 
separate training set. 

With a learned mCPA model, we use the same method as that 
of the standard CPA model to recognize a probe face, which 
requires us to select the correct component corresponding to 
its pose and expression before hand. In a fully automatic 
face recognition system, we may figure out the pose and 
expression by off-the-shelf methods, which, we will discuss in 
Section VI-D. 

VI. Experiments 

In this section, we present experiments to evaluate the 
proposed CPA method in the context of face recognition 
across illuminations, poses, and expressions. We used images 
of frontal face with neutral expression as the gallery, and face 
images with illumination, pose, and expression variations as 
the probes. We used the CMU Multi-PIE [32] and MUCT 
[33] datasets to conduct our experiments. The CMU Multi-PIE 
dataset contains face images with well controlled illumination, 
pose, and expression variations, and is thus intensively used 
for controlled experiments. 

We designed an mCPA model whose component for frontal 
view and neutral expression consists of 21 facial parts of 
varying sizes. The basic constellation of the 21 parts is listed 
in Table I. For the other components, the availability of a part 
is determined by its visibility. The parameters used for the 
mCPA learning are set as A = 1, fj = 0.02, and d = 0.25 for 
all the experiments reported in this section. 

With learned mCPA models, we first evaluated our method 
in Section VI-A in the scenario of face recognition across pose 
and expression with illumination variation. In particular, we 
show the advantages of our CPA method over other alternatives, 
such as the methods of a holistic face alignment followed by 
a holistic or part-based face recognition. We then demonstrate 
in Section VI-B the effectiveness of our CPA method when 
using part-based face recognition strategy. In Section VI-C, we 
test on face images with synthesized occlusions to demonstrate 


Table I: Canonical part locations and sizes for the mCPA 
component for frontal view and neutral expression in the 
experiments. 


No. 

Center location 

Relative 

Absolute 

w 

h 

w 

h 

1 

Center of R-eyebrow 

0.4 

0.2 

24 

16 

2 

Center of L-eyebrow 

0.4 

0.2 

24 

16 

3 

Outer corner of R-eye 

0.53 

0.4 

32 

32 

4 

Center of R-eye 

0.4 

0.2 

24 

16 

5 

Inner corner of R-eye 

0.58 

0.44 

35 

35 

6 

Inner corner of L-eye 

0.58 

0.44 

35 

35 

7 

Center of L-eye 

0.4 

0.2 

24 

16 

8 

Outer corner of L-eye 

0.53 

0.4 

32 

32 

9 

R-wing of nose 

0.27 

0.4 

16 

32 

10 

L-wing of nose 

0.27 

0.4 

16 

32 

11 

Apex nasi 

0.53 

0.28 

32 

22 

12 

Philtrum 

1.07 

0.44 

64 

35 

13 

R-corner of mouth 

0.32 

0.24 

19 

19 

14 

L-corner of mouth 

0.32 

0.24 

19 

19 

15 

Mouth center 

0.67 

0.28 

40 

22 

16 

Center of underlip 

0.53 

0.2 

32 

16 

17 

Bottom of jaw 

0.53 

0.28 

32 

22 

18 

R-ear 

0.4 

0.4 

24 

32 

19 

L-ear 

0.4 

0.4 

24 

32 

20 

R-cheek 

0.53 

0.4 

32 

32 

21 

L-cheek 

0.53 

0.4 

32 

32 


Remark: The relative sizes are in terms of the conventional facial 
region aligned with the two eyes, i.e., 60x80 window with two outer 
eye corners at (5,22) and (56,22); and, the absolute sizes are measured 
in pixels for the part dictionaries. 

the robustness of our method. Finally, we compared with the 
state-of-the-art across-pose face recognition methods in Section 
VI-D. 

For controlled experiments reported in Sections VI-A, VI-B, 
and VI-C, we initialized face locations by manually annotating 
eye corner points, and assumed that the pose and expression of 
each probe face are given. For practical experiments in Section 
VI-D that conduct fully automatic face recognition across pose, 
we used off-the-shelf face detector [41] and pose estimator 
[25] to initialize our method. 

A. Face Recognition Across Pose and Expression with Different 
Illumination 

Our choice of off-the-shelf methods for face recognition 
across illumination is based on SRC [26], for which we used 
multiple gallery images of varying illuminations for each 
subject. Under this setting, we compare CPA with the following 
three baseline alternatives. 

1) The first one manually aligns probe face images using 
labeled eye-corner points. After manual alignment, face 
recognition is conducted in a holistic manner. This 
alternative method is termed as “manual+holistic”. 

2) The second one automatically aligns probe face images 
using the holistic alignment algorithm of Wagner et al. 
[3]. After alignment, face recognition is again conducted 
in a holistic manner. This alternative is termed as 
“holistic+holistic”. If using SRC [26] as a classifier, this 
alternative is essentially the same as in [3], which is 
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similar to our CPA method in the way that the gallery is 
pruned to a subset before being used for recognition. For 
a fair comparison, we also made the subset consist of 
P subjects for the alignment algorithm of Wagner et al. 
[3]. 

3) The third one also automatically aligns probe face images 
using the holistic alignment algorithm of Wagner et al. 
[3]. The same pruning scheme was used as for the second 
alternative. However, after alignment, a part-based face 
recognition strategy is used where positions of local parts 
are pre-defined and fixed relative to the global face. This 
alternative is termed as “holistic+parted”. 

Our CPA method will be occasionally referred to as 
“mCPA+parted” in accordance with the names of these baselines. 
The chosen subject number for pruning the gallery was set to 
P = 20. 

1) Evaluation on the Multi-PIE Dataset : The CMU Multi- 
PIE [32] is the largest publicly available dataset suitable for 
test of our CPA method. It contains face images of 337 subjects 
captured over the time span of about 5 months. These face 
images are organized into 4 sessions according to their capture 
time. There are 200 ~ 250 subjects present in each session, 
where every subject is imaged from 15 different viewpoints with 
the flashlight varying in 18 different directions . Face images 
in Multi-PIE are with 6 well controlled facial expressions, 
including the neutral one and 5 non-neutral ones. 

In our experiments, we used face images of 7 different 
illuminations (illumination id: {0, 1, 7, 13, 14, 16, 18}) in 
the training and gallery sets, and those of a novel illumination 
(illumination id: 10) in the probe set. We used 9 subjects 
(subject id: {267, 272, 273, 274, 276, 277, 278, 286, 289}), 
who appear in all the 2 nd , 3 rd , 4 th sessions, for training of the 
tree-structured shape model. The 249 subjects that appear in 
Session 1 (subject id: 1 ~ 249) were used for testing. Their 
face images in Session 1 constituted the gallery set, and those 
in the other sessions were the probes. Table Ha gives more 
specifications on the probes. 

For experiments of face recognition across pose on the Multi- 
PIE dataset, we used the neutral-expression face images of 5 
viewpoints of yaw angles at 0°, ±15°, and ±30°. For those 
across expression, we used face images of all the 5 available 
non-neutral expressions under the frontal viewpoint. We built 
a 10-component mCPA model as shown in Fig. 7 and Fig. 8. 
Unless otherwise mentioned, these experiment settings were 
also used in Section VI-B, VI-C, and VI-D. 

Table lib reports recognition results of different alternative 
methods on face images with varying degrees of pose changes. 
Compared with “manual+holistic”, “holistic+holistic” gives 
improved performance, which suggests that automatic align¬ 
ment can improve face registration accuracy and consequently 
help for face recognition across pose, even in a holistic 
alignment manner. The performance of “holistic+parted” is 
unstable, which tells that simple part-based recognition without 
part alignment is not a feasible approach. Our CPA method 
significantly outperforms all these baselines. For the 4 non- 
frontal viewpoints, CPA improves over the best alternative 
method “holistic+holistic” approximately by 15% ~ 40% in 
terms of recognition rate. 


Table II: Face recognition across pose and expression with 
different illumination on the first 249 subjects of Multi-PIE 
dataset. Gallery illuminations: {0, 1, 7, 13, 14, 16, 18}. Probe 
illumination: 10. 


(a) Glossary for expressions of the probes. 


E.0 

E.l 

E.2 

E.3 

E.4 

E.5 

Neutral 

Surprise 

Squint 

Smile 

Disgust 

Scream 

S[234]R1 

S2R2 

S2R3 

S3R2 

S3R3 

S4R3 


Remark: SaRb - Recording number b in Session a. 


(b) Recognition rates (%) for across-pose and neutral-expression (E.0) 
settings. 


Align. 

Recog. 

-30° 

-15° 

0° 

+ 15° 

+30° 

13 0 

14 0 

05 1 

05 0 

04 1 

manual 

holistic 

8.03 

49.40 

91.37 

58.03 

8.84 

holistic 

holistic 

13.45 

60.64 

92.97 

80.52 

32.73 

holistic 

parted 

14.86 

61.85 

94.78 

62.85 

13.25 

mCPA 

parted 

54.62 

93.78 

99.60 

95.18 

67.07 


(c) Recognition rates (%) for across-expression and frontal-pose (05_1) 
settings 


Align. 

Recog. 

E.l 

E.2 

E.3 

E.4 

E.5 

manual 

holistic 

44.85 

75.76 

64.78 

49.69 

28.74 

holistic 

holistic 

67.88 

81.21 

68.55 

65.41 

36.78 

holistic 

parted 

67.88 

81.21 

79.25 

61.01 

43.10 

mCPA 

parted 

84.85 

93.33 

89.94 

85.53 

58.62 


An interesting observation from Table lib is that our CPA 
method can almost perfectly perform face recognition under the 
across-session setting of Multi-PIE, where probe face images 
of frontal viewpoint and neutral expression are from Sessions 
2, 3 ,4 of Multi-PIE and gallery face images are from Session 
1. Under this setting, our method gives 99.60% recognition 
rate, about 5% higher than that of “holistic+holistic”, i.e., 
the method in [3]. To the best of our knowledge, this is the 
best publicly known result under this setting. Note that the 
challenges of face recognition across sessions are mostly due 
to appearance change of human faces after a certain period 
of time, since illumination variation should have ideally been 
compensated by the multiple images of varying illuminations in 
the gallery. Nevertheless, our method suggests that by assuming 
human face as a piece-wise planar surface and using part-wise 
alignment, the problem of face recognition across sessions can 
to a large extent be overcome. 

Table lie reports recognition results on frontal-view face 
images of varying expressions. Comparative performance of 
different alternative methods is very similar to that reported in 
Table lib for face recognition across pose. Our CPA method 
outperforms the second best “holistic+holistic” method by 12% 
~ 21% in terms of recognition rate. 

2) Evaluation on the MUCT Dataset: In practical face 
recognition scenarios, illumination usually show various levels 
of strength, and people often present near-neutral expressions. 
The MUCT dataset is a good simulation of such scenarios, 
where face images of natural expressions like frowns and minor 
smiles are captured under frontal lighting of 3 strength levels. 
We used the first 10 subjects (subject id: 0 - 9) of MUCT for 
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(a) -30° (b) -15° (c) 0° 

Figure 7: The mCPA components fitted to faces with different poses. 


(d) +15° (e) +30° 

Refer to the caption of Fig. 3 for explanation. 



(a) Surprise (b) Squint (c) Smile (d) Disgust (e) Scream 

Figure 8: The mCPA components fitted to faces with non-neutral expressions. Refer to the caption of Fig. 3 for explanation. 


Table III: Recognition rates (%) for across-pose settings on the 
MUCT dataset. 


Align. 

Recog. 

Yaw Only 

Pitch Only 

+20° 

+38° 

+21° 

-22° 

b 

c 

d 

e 

manual 

holistic 

99.17 

55.89 

95.15 

99.58 

holistic 

holistic 

99.86 

93.20 

98.61 

99.58 

holistic 

parted 

99.17 

69.21 

81.14 

98.75 

mCPA 

parted 

100 

99.86 

99.86 

100 


training the tree-structured shape model, and the rest 265 for 
testing. Frontal-view face images (labeled as “a”) were used 
as gallery, and those of the 4 available non-frontal viewpoints 
(labeled as “b,c,d,e”) were used as probes. 

Table III reports recognition results of different alternative 
methods. Consistent to the results reported in Section VI-A1, 
our CPA method outperforms all the other 3 alternatives. It in 
fact performs almost perfectly for all the 4 degrees of pose 
change. 

B. Effectiveness of Part-based Recognition in CPA method 

The CPA method integrates part alignment with part-based 
recognition in a cohesive way. Part-based face recognition fuses 
weaker predictions from individual parts to make a stronger 
final decision, where a variety of methods can be used for 
recognition of individual parts. In this section, we investigate 
the varying discriminative power of individual parts, and also 


how existing representative face recognition methods perform 
for part recognition in CPA. We also present experiments 
to show the efficacy of the proposed pruning scheme in 
Algorithm 2. These investigations were conducted on the Multi - 
PIE dataset under the setting of face recognition across pose 
with different illumination for the gallery and probe (the setting 
in Section VI-A1 for producing Table lib). 

1) Part Discriminativeness: Fig. 9 reports results obtained 
by recognizing each of the 21 parts that are jointly aligned 
by our CPA method. The recognition rates on most of the 
21 parts are around 60% ~ 90% for the frontal pose and 
40% ~ 80% for the ±15° poses. These results suggest that 
the discriminativeness of individual parts is good, but not 
strong enough for a high-accuracy recognition performance. 
By fusing the predictions from individual parts, a high-accuracy 
final recognition (“ALL” in Fig. 9) can be achieved for pose 
change of within —15° —hl5° yaw angles. Even for a larger 
pose change of ±30° yaw angles, our CPA method by fusing 
predictions from aligned individual parts performs fairly well, 
while recognition rates for most of individual parts are below 
20%. This again demonstrates the efficacy of our proposed 
CPA method that integrates part-based recognition with part 
alignment. 

2 ) Integrated with Different Recognition Methods: Recog¬ 
nition of aligned individual parts in CPA is realized by 
off-the-shelf face recognition methods. Representatives of 
these methods include Nearest Subspace (NS) [2], Linear 
Discriminate Analysis (LDA) [43], and Local Binary Pattern 
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Figure 9: Accuracy of recognizing faces by single parts on Multi-PIE (first 249 subjects). Parts are aligned by CPA, but used 
individually for recognition. Unavailable parts are referred to as “x”. “ALL” means voting the predicted labels of all the parts. 


Table IV: Integrated with different recognition methods on 
Multi-PIE (first 249 subjects) 


(a) Illumination settings 

Strategy 

Gallery illumi. 

Probe illumi. 

DI: Different ilium. 

0, 1, 7, 13, 14, 16, 18 

10 

SI: Same ilium. 

7 

7 


(b) Recognition rates for across-pose and neutral-expression settings 


Recog. 

Align. 

-30° 

-15° 

0° 

+ 15° 

+30° 

13 0 

14 0 

05 1 

05 0 

04 1 

NS 

holistic 

8.84 

44.38 

76.31 

60.44 

20.68 

(DI) 

mCPA 

43.98 

85.94 

98.19 

88.76 

65.26 

LDA 

holistic 

9.64 

42.57 

84.94 

63.45 

16.67 

(DI) 

mCPA 

26.51 

73.90 

98.80 

77.31 

28.57 

LBP 

holistic 

34.34 

79.52 

95.18 

88.35 

39.76 

(DI) 

mCPA 

59.44 

86.95 

97.39 

87.55 

65.06 

LBP 

holistic 

57.03 

92.77 

97.39 

92.97 

60.44 

(SI) 

mCPA 

83.53 

96.99 

98.80 

95.38 

76.91 


(LBP) [4! ]. In this section, we investigate how these different 
choices of recognition methods perform when integrated 
into our CPA method. Here we compared the CPA method 
(“mCPA+parted”) with only “holistic+holistic”. 

For LDA, we learned its projection matrix after pruning 
candidate subjects and aligning gallery images, both of which 
are necessary steps for “holistic+holistic” and our CPA method. 
For LBP, we also extracted LBP features in the same stage, 
i.e., after the preparation of aligned gallery subsets. All the 
experimental settings were the same as those used in preceding 
sections, expect replacing the previously used SRC with NS, 
LDA, or LBP. Considering the promise of LBP for one-shot 
face recognition where only a single image per subject is 
available in gallery, we also conducted experiments under this 
one-shot setting, where probe and gallery face images are of 
the same illumination. Table IVa summarizes the illumination 
settings of experiments reported in this section. 

Recognition rates of CPA with integration of different 
methods are reported in Table IVb. Table IVb tells that with any 
choice of NS, LDA, or LBP, our proposed CPA for part-based 
alignment is superior to holistic face alignment. This confirms 
that our proposed CPA helps face recognition by effectively 
aligning individual facial parts. 


Table V: Impact of the pruning scheme on CPA (first 249 
subjects of Multi-PIE): the standard CPA is compared with 
the CPA without pruning scheme (the baseline). Top half: 
“Corrected” errors are those occurred for the baseline but not for 
the standard CPA, and the “introduced” errors are the opposite. 
Bottom half: ARR = (Standard CPA’s RR) — (Baseline’s RR). 


Errors ... 

Cnt. 

-30° 

-15° 

0° 

+ 15° 

+30° 

by pruning 

type 

13 0 

14 0 

05 1 

05 0 

04 1 

Corrected 

Num. 

77 

26 

0 

26 

78 

% 

15.46 

5.22 

0.00 

4.22 

15.66 

Introduced 

Num. 

33 

14 

1 

14 

39 


% 

6.63 

2.81 

0.20 

2.01 

7.83 

RR - no pruning % 

45.78 

91.37 

99.80 

92.97 

59.24 

ARR with orig. % 

8.84 

2.41 

-0.20 

2.21 

7.83 


Remarks: RR - recognition rate; orig. - pruning with P = 20. 


3) Pruning Efficacy: Pruning scheme in Algorithm 2 cer¬ 
tainly leads to more efficient algorithm. In this section, we 
are interested in investigating how it impacts the recognition 
performance. To this end, we modified Algorithm 2 by simply 
removing the pruning scheme, i.e., setting P to the subject 
number in the gallery. Bottom half of Table V lists recognition 
rates of the non-pruning CPA method together with their 
difference with those of the original CPA. For the non-frontal 
poses, noticeable performance drops occur when the pruning 
scheme is removed from CPA. Top half of Table V gives 
more evidence on the impact of the pruning scheme. Taking 
the non-pruning version as the baseline, we find that, for the 
non-frontal poses, the original CPA method corrects one time 
more recognition errors than it introduces, where “correcting” 
a recognition error means correctly recognizing an image that 
is falsely recognized by the baseline, and “introducing” means 
the opposite. Experimental results reported in Table V were 
obtained by using SRC for part recognition. All other settings 
were the same as those used in preceding sections. 

C. Recognition with Synthetic Random Block Occlusion 

We report experiments in this section to demonstrate the 
robustness of CPA against partial occlusion. As shown in 
Fig. 10, we synthesized partially occluded probe face images 
by adding block occlusion at random positions of a probe 
face. The size of occluded blocks varied from 10% to 60% of 
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10% 20% 30% 40% 50% 60% 

Figure 10: Images with synthetic block occlusions of different 
size. The percents below the images indicates the occlusion 
ratio with respect to the regions enclosed with dashed lines. 



(a) Recognition rates at different occlusion levels for the frontal-pose and 
neutral-expression setting 



(b) Recognition rates for across-pose and neural-expression settings with 
the occlusion ratio of 30% 

Figure 11: Recognition rates on Multi-PIE (first 249 subjects) 
when synthetic random block occlusions are present 


the holistic face. Experiment settings were set the same as in 
Section VI-A1. The method [3], i.e. “holistic+holistic”, was 
taken as the baseline. 

Fig. 11a reports recognition rates on frontal-view face images 
at different occlusion ratios, and Fig. lib reports those of 
varying viewpoints at the occlusion ratio of 30%. Compared 
with [3], Fig. 11a and Fig. lib show that our CPA method is 
more robust against partial occlusion. When a small portion 
(10%) of face images is occluded, recognition rate of [3] drops 
below 90%. In contrast, recognition rates of CPA remain above 
90% when the occlusion ratios increase from 0% to 30%. For 
the 30% occlusion ratio, recognition rate of CPA is more than 
45% higher than that of [3] for pose change of within ±15°. 

D. Comparison with the State-of-the-art 

In this section, we used the Multi-PIE dataset to compare 
CPA with the state-of-the-art Morphable Displacement Field 
(MDF) method [34] for face recognition across pose. MDF 


achieves good performance by learning an MDF model from 
a large number of 3D face shapes, while training of CPA only 
requires 2D face images of a few subjects. 

Fike our CPA method, MDF can also be equipped with 
different features and classifiers for recognition. In order fairly 
compare the two methods, in this paper, we used the same LDA 
classifier for both the two methods 6 . While LDA models were 
learned from pruned gallery in the CPA method, a single LDA 
model was learned from the entire gallery set in advance for the 
MDF method, as it does not prune gallery during recognition. 

We compare the CPA and MDF methods using probe face 
images that are under different pose and illumination conditions 
from those of gallery images. The experiment settings were 
largely the same as those of face recognition across pose with 
different illumination used in Session VI-Al, except that: 

1) In order to be consistent with the experimental protocol in 
the work of Li et al. [34], we used the last 137 subjects 
instead of the first 229 ones in Multi-PIE for testing. 
More specifically, for each subject, face images in the 
session where he/she first appears were included in the 
gallery set, and those in the other sessions were in the 
probe set. 

2) We learned the tree-structured shape model in CPA using 
face images of 9 other subjects (subject id: {038, 040, 
041, 042, 043, 044, 046, 047, 048}), who appear in all 
the four sessions of Multi-PIE. 

3) We set P = 10 for the pruning scheme used in 
Algorithm 2. 

Table Via reports recognition rates of CPA and MDF on 
neutral-expression face images of varying degrees of pose 
change, where both fully automatic and semi-automatic (manual 
initialization with known pose) experiments are reported. For 
the case of semi-automatic experiments using LDA as the 
classifier, our CPA method outperforms MDF when the degrees 
of pose change are within ±15°. When the degrees of pose 
change increase to ±30°, CPA performs worse than MDF. The 
performance inconsistency between smaller and larger degrees 
of pose change is in fact determined by the algorithm nature 
of the two methods. CPA is based on 2D similarity transform 
of individual facial parts, whose training only requires face 
images of a few subjects, while a large number of 3D face 
shapes are necessary for learning an MDF model. By learning 
more generic knowledge of 3D face shapes, MDF is able to 
better cope with face recognition across a larger degree of pose 
change. However, for the range between —15 and +15 degrees, 
which are more often encountered in practical scenarios such 
as access control, and face recognition in which is also more 
reliable, CPA gives more accurate recognition results than MDF 
does. In addition, our CPA method with SRC gives similarly 
better results in this range of pose change. 

We realize a fully automatic CPA method by initializing our 
algorithm using Viola and Jones [41]’s face detector, followed 
by a coarse and holistic alignment using the method of Wagner 
et al. [3]. For knowledge of face pose used in the mCPA model, 

Experimental results of MDF were produced by Li et al. [34]. As SRC 
was not implemented in their experiment pipeline, we thank them for their 
kind help to run MDF with the alternative LDA classifier. 
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Table VI: Face recognition across pose and expression with 
different illumination on the last 137 subjects of Multi- 
PIE. Gallery illuminations: {0, 1, 7, 13, 14, 16, 18}. Probe 
illumination: 10. 


(a) Recognition rates of MDF and CPA when integrated with 
LDA. Initialization is manually done by fitting eye-corner 
annotations, and pose is known in advance. 


Alignment 

-30° 

-15° 

0° 

+ 15° 

+30° 

13 0 

14 0 

05 1 

05 0 

04 1 

MDF 

68.71 

82.21 

/ 

80.37 

74.85 

mCPA 

38.65 

85.28 

98.77 

86.50 

47.24 


shape model. This objective is formulated as a norm minimiza¬ 
tion problem regularized by the graph likelihoods, which can 
be efficiently solved by an alternating optimization method. 
CPA can easily incorporate many existing face recognition 
method for part-based recognition. Intensive experiments show 
the efficacy of CPA in handling illumination, pose, and/or 
expression changes when integrated with an recognition method 
robust to illumination changes. In further research, we are 
interested in applying applying/adapting CPA to other computer 
vision applications. 


(b) Recognition rates of the CPA method integrated with SRC for 
different initialization and pose estimation schemes. 


Initial¬ 

ization 

Pose 

-30° 

-15° 

0° 

+ 15° 

+30° 

13 0 

14 0 

05 1 

05 0 

04 1 

manual 

known 

53.37 

87.12 

98.16 

87.73 

76.07 

auto* 

known 

32.52 

82.82 

99.39 

94.48 

57.06 

auto* 

, ** 

auto 

33.13 

80.98 

99.39 

94.48 

57.06 


face detector followed by holistic alignment; ** pose estimator. 


we consider two situations where poses of probe faces are either 
known in advance or estimated by Zhu and Ramanan [25] ’s 
method. The accuracy rate of Zhu et al.’s pose estimator is 
96.56% under our experiment settings 7 . 

Table VIb reports recognition results of fully automatic CPA. 
Fully automatic CPA (the last two rows) performs comparably 
with the semi-automatic one (the first row) when pose changes 
of probe face images are within ±15°. Note that the fully 
automatic alignment works from coarse to fine granularity, 
say, sequentially uses Viola and Jones [41]’s face detector, the 
method of Wagner et al. [ ] for a holistic alignment, and part- 
based alignment by CPA. The experimental results show that 
this coarse-to-fine strategy works well for reasonable degrees 
of pose change (e.g., within ±15°). For larger degrees of 
pose change, performance of holistic alignment by [3] drops. 
Consequently, our method loses the chance of correcting its 
alignment failure. In addition, the last two rows of Table VIb 
also tells that the mCPA model performs equally well when 
poses of probe faces are either given or automatically estimated. 
This confirms the effectiveness of our proposed method for 
automatic part-based face alignment. 

VII. Conclusion 
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Appendix A 
Linear System in (15) 

Let Q = [</i, (/ 2 , • • •, Qm\ C R dxrn , (15) is the same as the 
linear equation array: 


+?? 


( dgji^+Ai^Z) 

\ dA iA*) 


T 


= Qi, 


(25) 


for i = 1, 2,..., m, which can be written in a more standard 
form 


(W n 

W 12 

w 21 

w 22 

(w ml 

W m2 


Wi m \ 


( AzA 1 ) \ 


(cA 

W 2m 


Av^ 

— 

C2 

Wmm ) 


j 


Vm / 


where c* G is a column vector, and Wij G R dxd . Recall 
that £ is the edge set of the tree-structured model. In particular, 
if i is the parent of j, it holds (i, j) G £. Now, we can find 
that 


Ci = qi+v[ A (i> (/*« - uf> ) - (m 0> - ^ ) 

for i = 1,2,..., m, and 

' G <0 + V Elg{i}U{J:(i,I)€£} A<0 > 3 = *> 

-??A<A (i,j)e£, 

-r)A {i) , (j,i)e£, 

0, otherwise, 

for for i, j = 1, 2,..., m. 


Wij = < 


In this paper we propose a method termed CPA for across- 
pose and -expression face recognition, which can be benefited 
by pixel-wisely accurate alignment. The CPA model consists 
of appearance evidence of each part and a tree-structured shape 
model for constraining part deformation, both of which can 
be automatically learned from training images. To align a 
probe image, we fit its parts to the appearance evidence with 
consideration of constraint from the learned tree-structured 

7 It seems that Zhu et al.’s method [2i ] can work very well only when 
the training and test images are with the same illumination. To handle the 
illumination variation in our experiments, we normalized image illumination 
by the non-local means (NLM) based method [46] before pose estimation. 


Appendix B 

Solving Transformations in (16) with Fixed Parts 

G is the 2-D similarity group parametrized in R d (d = 4) as it 
is defined in (1). Given initial values on a G G and {v^ }'^L 1 G 
G, (16) seeks the minimum of g(y,Z) while keeping the 
value of a o i/W unchanged. Let v = [v^\ v^\ ..., u^] G 

M dxm , and A v = [Ai/W, A^ 2 \ ..., Az/< m >] G R dxm . We 

reformulate (16) as 

min g (is + Ai/, Z ), (27) 

Ai/,Acr v ' 

s.t. (a + Act) o + Az/^) = a o v^ % \ (28) 
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After optimizing it, a is updated to a + A cr, and is updated 
to v + Aia 
L et 

!/<*>=(«« 6>« *£>,*«), 

Aa = (6s*,68*,6t* u ,6t* v ), 

where s\0\t u ,t v (“■” for “(i)” and “*”) respectively denote 
the scale, rotation, horizontal translation, and vertical translation 
of a 2-D similarity transformation. (28) is equivalent to 

/ -fe* \ 

—Art* 

Ai/<*> = 


— 66 * 

—tu^ +/«+/« 

-4 i} + r v + /i i} ) 


(29) 


where, 
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and, 
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1 m 
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and, 
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cos (50*+6*) 
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For i satisfying (j, i) E £ , j ^ 0 (i is not directly linked with 
the root), it holds 
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/ sW — 5 O') \ 
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Substituting (29) into the objective function in (27), we obtain 
the unconstrained equivalence of the original problem, say, 
minAcr g (y + Azz, Z), where Av is expended as (29). We 
solve this problem by gradient descent. Now, we need to find 
g (y + Aiz, Z ) 9 s gradient in Act. 

Recall that £ is the tree edge set. For i = 1, 2,..., m, let j 
satisfies (j, i) E £ (j is the parent of i), and 

cpW = ((i/<‘> + Ai/‘>) - (i/W + Az/< J '>) - /*<*>' 

We then have 


Appendix C 
Solution to (19) 

We solve (19) by the alternately conducting the following 
two steps: 

1) Fix <7 and solve z>, {E^}^1 1 . 

2) Fix {A^}^L 1 ,{E^}‘^ =1 and solve cr,*/. 

Step 1: Given fixed cr, we update z> by a generalization of 
the Gauss-Newton method. To be specific, for a linear update 
from z> to z> + A*> (Az> e M dxmxn ) 5 W e approximate the 
equality constraint in (19) by its first-order Taylor expansion 
at T^(v), i.e., D o cr o (i>) & D o cr o + 

ELi Aj>)e fe e|, where = d{d k o ( j k ou^ ) )/di' l f ) 

is the Jacobian w.r.t. , and {e/ c }^ =1 denotes the standard 
basis of M n . The above linearization leads to the following 
problem to optimize {E^}^ 1: AO 


mm 

A<*) .EW ,Ai> . 

i = l, 2 ,...,m Kl = l 


^(pWlU + A^H^Hi) 

= 1 

vit,g(Tk(i' + Av),z) 1 


(32) 


k = 1 


For i satisfying (0,i) E £ (i is directly linked with the root), 
it holds 


s.t. DoaoT {i) (u)+^2jU } T^\Au)e k el = + E {i) . 

k=1 

We repeatedly solve (32) to update is, until it converges to 
a local minimum, which gives the solution to the original 
problem(19) . 

We solve the convex problem (32) by adapting the ALM. 
Let 

h(A^,E^,T^( Air)) = 

n 

Doao T< 4 >( v) + 4 i>T<i> (Ai>)e fe e^ - A < 4 > - E®. 

k= 1 

The augmented Lagrange function is written as 

L 0 ({A^r =1 ,{E {i) }T=i,^,{r {i) }T=i) = (33) 

m s 

ll^ll* + A^ll^lli + (r^,h{A^,E^,T^{AA)) 

r x 


2 


' k= 1 
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where (T^)}^ are the Lagrange multiplier matrices, and the 
matrix inner product (•, •) is defined as (A, B) = trac e(A T B). 
Given initial {T^} £2 1? ALM iteratively and alternately updates 
{A^}'j ^ 1 , {E^ Az>, and {r^}^ by 

1) {A^}™ =1 ,{E^}™ =1 , Ai>^ 

argmin Lp({A®}Z 1; {£?<*>}£ 1? Ai>, {1^}™ x ); (34) 

i=l ,2 ,... ,m 

2 ) r< f > <- r<*> + $ • , £<*>, T< i > (Ai>)), 

for i = 1, 2,..., ra; 

where l is the iteration number, and {/3 z}z=i, 2,... is an sequence 
increasing to sufficient large. 

Directly solving (34) w.r.t. all the unknown variables is still 
a difficult problem. Instead, we alternately update them in 
three groups, i.e., {E^}^L l9 and Az>, so that each 

subproblem associated with any single group of variables has 
a closed form solution. More specifically, for i = 1, 2,..., ra, 
we update A 6) and E 6> sequentially by 

R {i) <-Do(7oT (i> (i>) + (l/7?i)r (i> , 

(U,t, V) <- svd + E r<‘> {AO)e k el - J , 

£ (i> <- *5 a<<>/a ^? (i> + E 4 } T^(Ai>)e k el - , 

where i?6> is an auxiliary variable used for notation conve¬ 
nience, S a (-) is the soft-thresholding function defined in (12). 
Then, for k = 1,2,..., n, we find the optimum Tk(Ai>) by 
solving 


0 = 


%({AW}p =1 ,{#}y =1 , Ag) 


dAzA 


(35) 


=Pi ((-R (i> - - E^)e k + jf Az/ fc 

, dg(T k (u),Z) 


(i) 


Ji 


dAv- 


!i> 


for i = 1, 2,..., ra. The ra equations forms a sparse linear sys¬ 
tem of the form in (15) (expended form in Appendix A), whose 

parameters are {G^}^L 1 and Q. Here, C?6> = fiiJ^ J^\ 
the i th column of Q is /3|(A<*> 4 E {i) ~ R (i) )t k - Still, we use 
the inexact scheme for ALM, in which {A^}^L l9 {E^}^L l9 
and Av are alternately updated for only once in each iteration 
of ALM. 

Step 2: Given fixed {A^}£2 15 {E^}^L l9 a & o can be 
taken as constant for any possible i and k. For every k = 
1, 2,..., n, we use the same technique as in Session III-B2 to 
solve (Tk and (z>). More specifically, denote °^k\ 

which should not be changed, and update and Tfc(i>) by 


min g (T k (0), Z) s.t. a k o v ^ = Q k 

T k (0),a k 


(i) 


(36) 


The solution to this type of problem is present in Appendix B 
By the way, cr might be initialized by perfect manual 
annotations in practice for part dictionary learning. In this 


case, we may fairly take it as a known variable the original 
problem (18) so that the algorithm efficiency can be improved 
by reducing an alternating loop. 

Appendix D 

Maximum a Posteriori with Gaussian-Wishart 
Prior 

The PDF of the Wishart distribution is 

W(A| V,r) = i|A| (r “ d “ 1)/2 exp ^D(Ar')j , 
s.t. p = 2 rd / 2 \V\ r / 2 T d (r/2), 

where r^(-) is the d-D Gamma function. The PDF of the 
Gaussian-Wishart distribution is 

0(M, A|u, k, V, r) = Af( k \u, (kA)" 1 ) • W(A|V, r). 

where jV’(-|-,-) denotes the Gaussian PDF defined in (4). 
The Gaussian-Wishart distribution is the conjugate prior of 
the Gaussian distribution, where “conjugate” means that the 
prior and the corresponding posterior follow the same type of 
distribution only with different parameters. 

Given n observations {vk}k=i drawn from A/"(/i, A -1 ), let 
H ml, Aml denote the ML estimations for /i, A. Now, taking 
the Gaussian-Wishart distribution 0(^0, Vo, ro) as the prior 
for /i, A, their posterior is 

A_1 IUfc}fe=i, Uo, Ko, Vo, r 0 ) = 0(M, A| u n , k„, V n , r n ), 

where, 

r„=r o + n, 
k„ = k 0 + n, 

_KoUo+n k ML 

u n ~ u\ i 

Kq + n 

K= (y»- + „A M ' l + ^fr)", 

H = (/i ml - ^o)(mml - uo) T . 

By finding the maximum of 0(/i, A -1 \u n , K n , V n , r n ), we 
obtained the MAP estimations of /x, A: 

Amap = (r 0 -hn - d)V n , 

Mmap = u n . 

Inversely, taking n = 0, we can set Q(uo, k,q, Vo, ro) to be 
consistent with specific Gaussian distribution Aq X ), say, 


Uo = /io, 

Vo = Aq 1 /(r 0 — d). 

In addition, note that using the prior <d(u o, Vo, ^o) equals 
to incorporating ro (or ^o) additional samples drawn from the 
Gaussian distribution N ((ro — d) Vo) -1 ^ into the existing 
observations, where ro and k,q should be normally set to the 
same value. In view of this, we set ro = ko = fin, where d > 0 
determines the weight of the prior w.r.t. the observations. 
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